Identification of short open reading frames in plant genomes

The roles of short/small open reading frames (sORFs) have been increasingly recognized in recent years due to the rapidly growing number of sORFs identified in various organisms due to the development and application of the Ribo-Seq technique, which sequences the ribosome-protected footprints (RPFs) of the translating mRNAs. However, special attention should be paid to RPFs used to identify sORFs in plants due to their small size (~30 nt) and the high complexity and repetitiveness of the plant genome, particularly for polyploidy species. In this work, we compare different approaches to the identification of plant sORFs, discuss the advantages and disadvantages of each method, and provide a guide for choosing different methods in plant sORF studies.


Introduction
Short/small open reading frames (sORFs) with the capacity of encoding micropeptides shorter than 100 amino acids (aa) are widely distributed in plants, ranging from green algae (Xu et al., 2017a) to rice (Yang et al., 2021) and Arabidopsis (Hanada et al., 2007), and are engaged in various biological and molecular processes, such as plant growth, nitrogen response, symbiosis nitrogen fixation, stomatal closure, plant circadian clock, anther development, pollen tube growth, abiotic responses, plant disease resistance, morphogenesis, and growth regulation (Ong et al., 2022;Wu et al., 2022). sORFs are pervasive in plant genomes and have been detected beyond the known coding regions. According to their locations, sORFs can be classified into several groups, including uORF, dORF, lncRNA-sORFs, and intergenic-sORFs (Table 1). As well as in mammals and yeasts (Jeon et al., 2015;Libre et al., 2021), sORFs have also been reported in plants, and the translation of uORFs can potentially repress the translation of downstream major ORFs (mORFs) (David-Assael et al., 2005;Saul et al., 2009;Pajerowska-Mukhtar et al., 2012; Causier et al., 2022). Another function of sORFs is to mitigate the abundance of miRNAs, and therefore influence the translation of their target mRNAs. Lauressergues et al. (2015) reported two plant primary transcripts (pri-miRNAs) that contain sORFs encoding regulatory peptides, miPEP165a and miPEP171b, in A. thaliana and Medicago truncatula, respectively. Overexpressing miPEP171b in M. truncatula roots specifically improves the accumulation of endogenous mature miRNAs, resulting in a reduction in lateral root density to a similar extent as overexpression of the corresponding pri-miR171b. Furthermore, the peptides encoded by the sORFs, per se, can also be functional, involving various biological processes (reviewed in Ong et al., 2022). Given that sORFs can substantially regulate the translation of downstream ORFs and encode proteins with crucial functions, the mutation of many sORFs, particularly uORFs, would lead to dramatic phenotype changes in plants and crops. Therefore, the natural or artificial mutation of these sORFs can be used to improve vital plant processes and valuable crop traits (Xu et al., 2017b;Si et al., 2020).
It is noteworthy that although reports of plant sORFs are increasing dramatically, the identification of sORFs in plant genomes is still challenging. The recent advancement and application of Ribo-Seq technology have promoted the research of plant sORFs; however, the existing studies on plant sORFs are only focused on plants with simple genomes, including the model plants, Arabidopsis and rice, while investigations into sORFs in complex genomes are rare. In most, if not all, existing plant sORF studies, the methods and tools developed in mammals or yeasts were directly used to search for sORFs in plant genomes. Nevertheless, plant genomes are generally more repetitive, and many of them are polyploid or paleopolyploid. Special attention should be given to the studies of sORFs in plant genomes.
Despite the challenges and difficulties, many efforts have been made to identify sORFs from plant genomes. The first systematic study of sORFs in plants was conducted on Arabidopsis thaliana, where more than 7,000 sORFs were identified (Hanada et al., 2007), including 49 that induced visible phenotypic effects or that are associated with various morphological changes (Hanada et al., 2013). Subsequently, a total of 48,620 sORFs were identified in Oryza sativa through microarray analyses, and at least 36 were involved in Fe deficiency and excess (Bashir et al., 2014). Generally, plant sORFs can be identified using three different strategies: the conservation of coding sequences, ribosome-protected footprints, and nucleotide diversity in the natural population. In this review, we summarize the methods of sORF identification used in plant studies and discuss the challenges and caveats in plant sORF identification and possible solutions for future studies. sORF identification through sequence conservation As functional ORFs are conserved across genomes, early attempts at sORF identification were primarily based on sequence similarities using sequence alignment tools, such as BLAST (Altschul et al., 1997), coupled with ORF Finder (Wheeler et al., 2002), assuming that the functional sORFs would also have been preserved by natural selection ( Figure 1A). For example, a total of 26 groups of conserved uORFs were identified in O. sativa and A. thaliana using uORF-Finder (Hayden and Jorgensen, 2007). In another study, sequence similarity was also observed at the amino acid level of uORFs across Arabidopsis and rice, 11 rice uORFs were conserved in Arabidopsis, and most of them were also conserved in other cereals (Tran et al., 2008). These conserved sORFs can be true; however, recent evidence has revealed varied degrees of sORF conservation. Although many sORFs are conserved, there are also many species-or lineage-specific sORFs (Hsu et al., 2016;Wang et al., 2021), which cannot be identified through sequence comparisons.

sORF identification using ribosomeprotected footprints
Ribo-seq is an emerging technology that enables the identification of sORFs (Ingolia, 2010). To date, most plant sORFs are identified using RPFs. Briefly, the ribosome-associated mRNAs are digested by RNase, and the fragments bound by ribosome are protected; they can then be isolated and sequenced using next-generation sequencing (NGS) technology. As ribosomes move along mRNA strands with a step of three nucleotides during the translation of mRNAs, the mapping loci of RPFs on mRNAs show a strong 3-nt periodicity, which provides information for the identification of translating frames on mRNAs ( Figure 1B). To better analyze and mine the information in RPFs, many computation tools implementing different algorithms have been developed to predict sORFs or ORFs on noncoding RNAs (see review of Ong et al., 2022), including FLOSS (Ingolia et al., 2014), RiboTaper , RiboCode (Xiao et al., 2018), ORFquant (Calviello et al., 2020), RiboNT (Song  (Wu et al., 2020) lncRNA-sORFs sORFs located in long noncoding RNAs Plant growth (Yang et al., 1993) Intergenic-sORFs sORFs located in an Intergenic region Plant growth (Frank and Smith, 2002) et al., 2021), and slORFfinder (Song et al., 2023). Among them, ORFScore and RiboTaper are the most frequently used tools in plant sORF studies. For example, Hsu et al. (2016) detected 187 uORFs, 10 dORFs, and 27 translated sORFs from annotated non-coding RNAs in Arabidopsis using RiboTaper (Hsu et al., 2016), and Wu et al.
(2020) detected 1,406 and 1,153 dORFs in human cells and zebrafish embryos using ORFScore (Wu et al., 2020). As a caveat, it should be noted that most of these tools were originally developed in the study of mammals or yeasts, the genomes of which are much simpler than those of plants; the challenges in plant studies were not fully considered. sORF identification using RPFs is highly dependent on the accurate mapping of RPFs, and the Ribo-Seq strategy has some inherent shortcomings when it is applied in plant studies because of the short lengths of RPFs (~30 nt) and the complexity and repetitiveness of plant genomes. It is difficult to accurately map the short RPFs to the loci they are derived from in a complex and repetitive genome. In most of the existing studies, the solutions to this problem are either to remove the RPFs with multiple hits or to randomly retain only one of them (Hsu et al., 2016;Bazin et al., 2017;Erhard et al., 2018). However, these processes would certainly introduce errors in ORF identification, resulting in missing ORFs in the genome. Recently, a protocol profiling the footprints of two closely packed ribosomes (disomes) that can double the size of footprints was reported (Arpat et al., 2020). The RPFs of disomes (~60 nt) can somehow compromise the mapping problem caused by their short lengths; nevertheless, they are still too short to completely solve this problem, particularly in the study of polyploidy genomes. Furthermore, only~10% of the ribosomes can be captured in disomes, with a significant bias towards rRNA and sequences encoding signal peptides; whether they can be used to identify sORFs genome-wide has not been tested. It is possible to increase the size of RPFs to~90 nt by profiling the footprints of trisomes, but their representativeness should be evaluated before they can be used in sORF identification.

sORF identification using degradome sequencing
Degradome sequencing is a high-throughput method that was originally used for the identification of endogenous siRNA and miRNA targets by combining the modified rapid amplification of 5' cDNA ends (5'-RACE) and NGS technology (Addo-Quaye et al., 2008). Briefly, 5'-3' exoribonuclease cleaves the translating mRNAs at the last ribosomes that can be translocated codon by codon, leaving a set of truncated transcripts with both free 5' monophosphates and 3-nt distance in length. After being sequenced with the NGS platform, this 3nucleotide periodicity in the position of free 5' mRNA ends could be revealed by mapping reads to mRNAs (Bertoni, 2016), which provides an alternative approach for sORF identification ( Figure 1C) (Hou et al., 2016). Using genome-wide mapping of truncated mRNAs, Yu et al. Graphic illustration for four different strategies of sORF identification. (A) Sequence conservation-based. It is assumed that functional sORFs are preserved by natural selection and are conserved across species; sORF identification could be based on sequence similarities using sequence alignment tools, such as BLAST. (B) RFP-based. The ribosome-associated mRNAs are digested by RNase and the fragments bound by ribosome are protected; they can then be isolated and sequenced using next-generation sequencing (NGS) technology. As ribosomes move along mRNA strands with a step of three nucleotides during the translation of mRNAs, the mapping loci of RPFs on mRNAs show a strong 3-nt periodicity. (C) Degradome sequencing-based. 5'-3' exoribonuclease digests translating mRNAs chasing after the last translating ribosome, which translocates codon after codon on mRNAs, leaving truncated 5' monophosphate mRNAs with a 3-nucleotide distance in length. After sequencing, this 3-nucleotide periodicity in the position of free 5' mRNA ends can then be used for sORFs identification. (D) Natural nucleotide diversity-based. As only the nucleotide diversities in CDSs showed a significant 3-nt periodicity, the single-nucleotide polymorphism (SNP) datasets of natural populations can be used to predict sORFs.
(2016) discovered a 3-nt periodicity pattern throughout ORFs in Arabidopsis leaf samples, and the accumulation of cleavage events at 16 to 17 nucleotides upstream of the stop codons of both ORFs and uORFs was also observed because of ribosomal pausing during translation termination. These results, therefore, make it possible to search for potential sORFs (Yu et al., 2016). In another research, 3-nt periodicity was also observed, not only in Arabidopsis but also in Glycine max and Oryza sativa, and both novel and known uORFs were identified by searching the accumulation of 5' RNA ends peaking upstream of the stop codons (Hou et al., 2016). While degradome-based ORF prediction relies on the 3-nt periodicity of the mapping positions of NGS reads on mRNAs, truncated RNA fragments are much longer than RPFs, making them beneficial for resolving the mapping problems in complex genomes caused by short lengths of RPFs (Carpentier et al., 2021). However, as degradome sequencing was originally developed to identify the target of sRNAs, the fact that the binding of sRNAs can result in the accumulation of degradome reads out of the translating frames might, therefore, introduce unexpected errors in the prediction of ORFs. Although ORF prediction from degradome reads is similar to that from RPFs, whether those RPF-based tools can also be used to predict ORFs from degradome reads has rarely been tested. To better utilize the degradome datasets, more tools need to be developed or tested for sORF prediction in the future.

sORF identification using nucleotide diversity
As the third nucleotides in codons are wobble nucleotides and are therefore subject to a more relaxed purification selection in nature (Hurst, 2002), resulting in higher nucleotide diversities every three nucleotides in the coding sequences (Jiang et al., 2022). This pattern resembles the 3-nt periodicity of RPFs on mRNAs and can therefore also be used to predict ORFs ( Figure 1D). Jiang et al. (2022) recently developed a pipeline, OrfPP, to predict ORFs using the singlenucleotide polymorphisms (SNPs) datasets of natural populations and applied it to two polyploidy species: tetraploidy cotton (Gossypium hirsutum) and hexaploidy wheat (Triticum aestivum). As SNPs in most studies are usually called using 100 or 150 bp pairended reads, this strategy can overcome the troubles caused by the short lengths of RPFs in plant studies. Although SNP calling may also introduce several errors caused by the mismapping or multiple mapping of short reads, this problem can finally be solved by the future application of long reads in plant population studies. Indeed, long-read techniques have been used increasingly to detect genomic variants in natural populations of plants (Dorfner et al., 2022;Tang et al., 2022;Zhou et al., 2022). Another advantage of this method is the direct use of existing datasets (SNPs) requiring no extra experiments, such as the construction of Ribo-Seq libraries, which could be costly and technically challenging in some organisms or tissues, therefore allowing the large-scale identification of sORFs in plants with genome-wide SNP datasets. The SNP-based strategy can be used in complex genomes as long as their SNP datasets are available. Nevertheless, genome sequencing and assembly are also challenging and costly for these species, many of which do not have genome-wide SNP datasets either. The lack of SNPs prevents the application of an SNP-based approach in these species. Fortunately, for most polyploidy crops, such as oilseed, cotton, and wheat, the reference genome assembly and population resequencing have been completed, which can facilitate the identification of sORFs using the SNP-based approach.

Discussion
Although several approaches have been developed to identify sORFs, each has its own limitations, and caution should be taken in the application of different methods. Sequence conservation-based methods can only identify old and conserved sORFs but are powerless in identifying young sORFs. RPFs can only make use of short reads, thus resulting in incorrect mapping to genomes, and this problem would be even worse in a complex and repetitive genome. The degradome and SNP-based approaches take advantage of longer reads to produce high-quality unique mapping and should lead to better sORF prediction. Both the RPF and degradome-based identification are affected by the reads captured in the experiments, so the silent or lowly translated sORFs might have been missed in these data and the results can be substantially varied across different tissues or growth conditions. In contrast, the SNP-based strategy relies on the preparation of a high-quality library and can predict both active and inactive sORFs. Cross-identification using different approaches was also reported. Of the 89 conserved Arabidopsis sORFs, 39 were successfully identified by RPFs (Hsu et al., 2016). More than a quarter of the sORFs predicted from SNPs, which were actively translated, overlapped with those predicted using RPFs (Jiang et al., 2022). Thus, the SNP-based strategy is an effective approach to extending the study of sORFs, especially in complex genomes, but it requires the accumulation of nucleotide diversity in natural populations, and accuracy is also affected by the quality of the reference genome and SNPs datasets. In practice, these approaches are mutually complementary and can be chosen for different purposes. Overall, it has become increasingly clear that sORFs play important roles in various plant processes and are potential candidates for crop improvement. Although the application of Ribo-Seq in plant studies has substantially enhanced the understanding of plant sORFs, most of these studies were conducted in model species, such as Arabidopsis (Hsu et al., 2016;Mahboubi et al., 2021) and rice (Su et al., 2018;Xu et al., 2021), or crops with small and simple genomes, such as tomato (Wu et al., 2019); the sORFs in other plants, particularly polyploidy species, are poorly investigated. Given that approximately 47-70% of angiosperm species are polyploid (Masterson, 1994), more advanced techniques and algorithms are required in the future to enhance the understanding of plant sORFs.

Author contributions
JZ and WY conceived the idea. YF and MJ performed the literature search and data collection. YF and JZ wrote the manuscript. JZ and WY revised the manuscript. All authors contributed to the article and approved the submitted version.